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Abstract 

Technology development produces terabytes of data generated by hu¬ 
man activity in space and time. This enormous amount of data often 
called big data becomes crucial for delivering new insights to decision 
makers. It contains behavioral information on different types of human 
activity influenced by many external factors such as geographic infor¬ 
mation and weather forecast. Early recognition and prediction of those 
human behaviors are of great importance in many societal applications 
like health-care, risk management and urban planning, etc. In this pa¬ 
per, we investigate relevant geographical areas based on their categories 
of human activities (i.e., working and shopping) which identified from ge¬ 
ographic information (i.e., Openstreetmap). We use spectral clustering 
followed by k-means clustering algorithm based on TF/IDF cosine simi¬ 
larity metric. We evaluate the quality of those observed clusters with the 
use of silhouette coefficients which are estimated based on the similari¬ 
ties of the mobile communication activity temporal patterns. The area 
clusters are further used to explain typical or exceptional communication 
activities. We demonstrate the study using a real dataset containing 1 
million Call Detailed Records. This type of analysis and its application 
are important for analyzing the dependency of human behaviors from the 
external factors and hidden relationships and unknown correlations and 
other useful information that can support decision-making. 
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Index terms — telecommunication dataset, human behavior, cell phone 
data records, activity recognition, knowledge management, clustering and clas¬ 
sification 


1 Introduction 

Nowadays extensive penetration of digital technologies into everyday life cre¬ 
ates vast amount of data related to different types of human activity. When 
available for the research purposes this creates an unprecedented opportunity 
for understanding human society directly from its digital traces. There is an 
impressive amount of papers leveraging such data for studying human behavior, 
including mobile phone records [T1I11I91I201EUE2], vehicle GPS traces [Ill El], 
social media posts imiiiiiii and bank card transactions [ISl HTj. And 
mobile phone data is among the most commonly used data sources from above. 
Such data allows us to understand human behaviors and social relationships by 
investigating the influence of context factors of social dynamics. Potential ap¬ 
plications of this research are a context aware application systems, and “smart 
cities” applications that provide decision support for stakeholders in areas such 
as urban, transport planning, tourism and event analysis, emergency response, 
health improvement, community understanding, economic indicators and oth¬ 
ers. Current research m El 0111119] notes that with a pure CDR, it is possible to 
identify human behaviors, but results suffer from the heterogeneity, uncertainty 
and complexity of raw datasets and that the lack of qualitative content is in¬ 
cluded in the data itself that may be used to help to infer human behaviors. 0 
identifies that Points of Interest (POIs) provide a good proxy for predicting the 
content of human activities in each area and thus for identifying the activities 
people are more likely to perform. This is much more effective if we combine 
mobile phone data records that can be more likely associated to human activ¬ 
ities, for example, a person looking for some food if phone call is located in or 
close to a restaurant. In this paper, we concentrated on characterization and 
clustering of geographical areas based on the categories of human activity in 
order to identify relevant areas. The model proposed in 0, is used to extract 
top level human activities from geographical open data source. We use spectral 
clustering with eigengap heuristic followed by k-means clustering and intrinsic 
method to evaluate the quality of clusters. In each area cluster, we contextually 
enrich the mobile phone data records with the categories of human activities 
and then analyze and identify the standard or exceptional (divergent) type of 
the communication activity temporal patterns. This further opens a discussion 
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to understand various type of relationships between environment and human 
behavior. The paper is structured as follows Section illustrates the related 
works and methodology is described in Section]^ We present and discuss the 
results in Section Finally, we summarize the discussions in Section 


2 Related works 

The clustering approaches ( Han & Kamber m) such as k-means, k-medoids, 
and self organizing map group similar spatial objects into classes, and several 
other methods are also used to perform effective and efficient clustering, for 
instance, Calabrese et al [23] and also [TOl [18] used eigengap heuristic for clus¬ 
tering. Phithakkitnukoon et al m identified area profile from POIs. Each area 
is connected to main activity considering the category of POIs, and activity 
patterns for each mobile user are studied to determine groups which have simi¬ 
lar activity patterns. Noulas et al m proposed an approach for modeling and 
characterization of geographic areas based on a number of user check-ins and 
a set of 8 general (human) activity categories in Foursquare. Cosine similar¬ 
ity metric is used to measure a similarity of geographical areas, and spectral 
clustering algorithm that followed by k-means clustering is applied to identify 
a relevant area profile. The area profiles enables to understand groups of indi¬ 
viduals who have similar activity patterns. Similar to this research idea, social 
networks[29| have been taken into account to discover activity patterns of indi¬ 
viduals. Frias-Martinez et al |7] studied geolocated tweets to characterize urban 
landscapes using a complimentary source of land-use and landmark information. 
The author focused on determining the land-uses in a specific urban area based 
on tweeting patterns, and identification of POIs in high activity tweeted areas. 
Differently, Yuang et al [3D] proposed to classify urban areas based on their 
mobility patterns by measuring the similarity between the time-series using Dy¬ 
namic Time Warping (DTW) algorithm. Some of areas focus on understanding 
urban dynamics including dense area detection and their evolution over time 

[HKiH]- 


3 Data-source collection 

We model contextual information of 4.6 million POIs in Trento, Italy and be¬ 
havioral dataset of I million mobile phone data records (CDR). 
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Table 1: The categories of human activity belong to the POIs 


Top level classes of activities 

Type of POIs 

eating 

fast food, food court, restaurant, cafe 

shopping 

grocery, general stores 

health medicine activity 

hospital, pharmacy 

entertainment activity 

bar, casino, movie, theater 

education activity 

library, university school 

transportation traveling 

airplane, bus, car, train 

outdoor activity 

sightseeing, personal care, religious places 

sporting activity 

car racing, summer, winter sports 

working activity 

professional work place, industrial place 

residential activity 

guest house, hotel, hostel, residential building 


3.1 Openstreetmap 

For modeling the contextual description of geographical regions in Trento, we 
use a High level Representation of Behavior Model (HRBModel) O [S] which 
exploits a spatial grid (i.e., cell size is 50m x 50m) which populated the locations 
(i.e., cell) with POIs from open geographic information, Openstreetmap (OSM). 
The model generates human activity distribution map which enriched by the 
categories of human activity associated with a likelihood measure. For example, 
watching a football match on a stadium, eating in restaurant or hiking in forest. 
We collected in total 135,918 relevant POIs extracted from OSM. After cleaning 
and discarding irrelevant POIs (i.e., those that do not reflect relevant human 
activity), the number of POIs is reduced to 31,514 POIs. The total number of 
human activities are extracted up to 78,068 belongs to the POIs. The top-level 
classes of activity categories belong to the POIs are explained in Table 

3.2 Mobile phone data records 

We collected CDR about outgoing logged calls for 2 months. The CDR is 
completely anonymized containing cell ID, time of day and duration in which 
a phone call is issued. Cell-ID is used for identification of some portion of a 
physical geographic area featured with a set of devices (antennas) that support 
the communication. Sometimes, cell coverage area is not precisely defined, and 
can be temporarily modified depending on the estimation of call traffic from/to 
this area. Usually the size of the coverage area is inversely proportional to the 
density of the population inhabiting the area. It is observed that, in presence 
of regular territory (i.e., flat with no mountains or other natural irregularities), 
the shape of cells can be approximated with convex polygons, otherwise the cell 
can be very irregular and possibly disconnected. 
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4 Methodology 


We present our approach for clustering algorithm to identify relevant areas in 
terms of geographical area profiles containing a set of the categories of human 
activity and use the observed clusters for analyzing mobile phone communication 
activities. Our approach is aimed at answering the following questions: What 
are the geographical area profiles defined by human activities in a city? How is 
the communication pattern affected by the profile of area activity? To do that, 
we first identify relevant area clusters based in terms of the category of activities 
and then analyze the communication activity patterns in those observed area 
clusters. 

4.1 Geographical area clusters 

We define a vector space model that contains a set of activity categories cor¬ 
responding to geographical areas. The representation of geographic areas li 
within the territory of a city or large square area L. Each area h contains a 
vector of different top level activities derived from the POIs in such area that 
would be an input data-point for identifying the relevant areas by cluster algo¬ 
rithms. The area features are represented by a matrix containing the weight 
of the activity categories j in each area Li. The relevance between the areas is 
identified by the cosine similarity metric by estimating the deviation of angles 
among area vectors. For example, the similarity between area h and I 2 is as 
cos 01.2 = . Having the estimation of similarity between the areas, we 

can now create a similarity graph described as the weight matrix W and the 
degree matrix D is utilized by the spectral clustering algorithm which is the 
one of the most popular modern clustering methods and performs better than 
traditional clustering algorithms. The K-Nearest Neighbors of each data point 
are identified using cosine similarity metric, we create the adjacency matrix of 
the similarity graph and graph Laplacian L = D — A (given by normalized 
graph Laplacian L„ = Based on eigengap heuristic, we identify 

the number of clusters to observe in our dataset as k = argmaXi{Xi+i — Xi) 
where Xi G {li, I 2 , h, In} denotes the eigenvalues of in the ascending order. 
Finally, we easily detect the effective clusters (area profiles) Ci, (72, Ca,..., Cfe 
from the first k eigenvectors identified by the k-means algorithms. 
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4.2 Behavioral pattern extraction in area clusters 

We are interested in the contextualization of communication activity temporal 
patterns in relevant area clusters in order to determine standard or exceptional 
type of communication activities based on the communication activity variations 
in different time context. This could be done by overlapping between human 
activity distribution map and cell coverage map. However in this analysis, 
the cell coverage map is unavailable and the location of cell towers are given 
approximately. We decided to divide the study area into Voronoi polygons 
based on the spatial distribution of cell phone towers. The Figure |l(a)| shows 
the Voronoi polygons for visualizing cell coverage map. We are then able to 
extract mobile communication activities in each polygon as a coverage area. 
For extracting mobile communication activities in observed area clusters, we 
need to associate each Voronoi polygon to the human activity distribution map. 
A given cell area p might be intersected with multiple areas h that represented 
as a list of areas with an intersection weight [0,1]. The intersection weight is 
estimated by the division of the areas size k and p, as W{pJi) = The 

number of calls per area is the number of calls in a given cell p at a certain time 
t divided by the number of intersecting areas N, taking the intersection weights 
into account, as X{p,t,li) = ■ Wi. 




(a) The density of commu¬ 
nication activity distribution 
over typical day 


(b) The density of commu¬ 
nication activity distribution 
over Easter Sunday 


Figure 1: The density of communication activity distribution is represented in 
the colors black belongs to high volume activity and white belongs low volume 
activity 
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Figure 2: Geo-visualization of area classifications (A:=17), Trento, Italy 

We are now able to extract mobile communication activity patterns in differ¬ 
ent area clusters. In order to identify a normal (typical) type of communication 
activity temporal pattern, we exclude the communication activities over spe¬ 
cific days when the public holidays or festivals occur, such as Liberation day. 
Palm Sunday and Easter Holiday (see the example during Easter holiday. Figure 
|l(b)[ ) because the communication activities could have significant changes. To 
estimate the variation boundaries of the typical communication activities over 
different time context in a given area cluster we use a sigma approach in order 
to determine exceptional (divergent) communication activity temporal patterns 
in each area cluster, as X'{c,t) = i a • <Jx(c,t)- 

5 Experimental Results and Discussion 

We are concentrated in identifying relevant areas in which we show how seman¬ 
tics of human activities could be observed to interpret standard or exceptional 
communication activity temporal patterns. We observed 17 clusters (area pro¬ 
files) according to activity vector of each area as shown in Figure For each 
cluster, we show the weight vector of customer activities for each cluster as de¬ 
scribed in Figurej^ The central part of the city is clustered into Cq, Gi followed 
by C\ 2 i CiQ where entertainment, residential, shopping, sporting and traveling 
by transport activities are highly distributed. We then extract the communi¬ 
cation activity temporal patterns in order to see how communication behavior 
is affected by types of customer activity. The overall average communication 
activity density per day for each cluster varies depending on the category of the 
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Figure 3: The scaled weight of activity categories of each cluster 


customer activities as shown in Figure]^ Ci is the most active cluster in terms 
of mobile phone activity communication. The weekly temporal communication 
activity variations per cluster are showing behaviors, similar to each other, see 
Figure 

Also there is clearly one purple cluster Cu (traveling by transport) which 
is more active over the weekdays compared to other clusters and less over the 
weekends and another one light-blue Ci (outdoor activity), which shows the 
opposite pattern to Cu. This patterns are highlighted in the subplot of Figure 
where the total percentage of weekend activity is reported. The different 
clusters patterns are different in terms of time context (e.g., hour of a day and 
day of a week). 

The daily communication activity timeline over weekday vs weekend (Sat¬ 
urday and Sunday) are shown in figures 6(a) |6(b) and |6(c) respectively. They 
generally demonstrate quite similar pattern for different clusters. Based on the 
euclidean distance metric Cu is described as the most distinct cluster pattern to 
the average communication activity pattern over weekday, Saturday and Sun¬ 
day. 

To assess the accuracy of the cluster quality when the ground truth of a 
dataset is not available, we have to use an intrinsic method. We evaluate 








Figure 4: Overall average communication activity density (log scaled) per day 
with the actual values 



Figure 5: Communication activity temporal variations on the day of week and 
the subplot is about overall average communication activity density over week¬ 
end 


these clusters by examining how well the clusters are separated and how com¬ 
pact the clusters are, based on the similarity metric (silhouette coefficient) be- 

. , . JZo'ec o^o' dist{o,o') / \ • 1 

tween objects in the dataset: ayo) = - '"\c.\_i -: where a{o) is the 


|Ci|-l_ 

average distance between o and all other objects in the cluster to which o 
belongs. Similarly, 6(o) is the minimum average distance from o to all clus¬ 
ters to which o does not belong. Formally, suppose o G Ci{l ^ i ^ k); then 

IZo'ec dist(o,o') 


b{o) = minCj:i<j<fcjy^i 


\CA 


The silhouette coefficient is between 


— 1 and 1, estimated by s(o) = ,nax{°a^)*'b(o)} • positive value reflects the 
more compactness of the cluster and well separated from other clusters. How¬ 
ever, when the silhouette coefficient value is negative, the object in the con- 
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(a) Daily communication activity pattern per cluster over week¬ 
day. The cluster patterns are more diverse over weekend while the 
patterns are almost similar over weekdays 



(b) Daily communication activity pattern per cluster over Satur¬ 
day. In the pattern Cn, there is an activity peak at the 3:00am 
on Saturday which means in the late night of Friday, people take 
a transportation to go home 



o hour 

(c) Daily communication activity pattern per cluster over Sunday. 
In the pattern of cluster C 14 , there is a peak every lam that might 
be resulted from the traveling by transport and working activity 


Figure 6: Typical communication activity temporal patterns over weekday vs 
weekend, the clusters Cn and Ci^ are the most distinct clusters compared to the 
average communication activity pattern over weekday, Saturday and Sunday 
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sidered same cluster is closer to the objects in another cluster. From the com¬ 
munication activity temporal patterns over the days of week, we estimated the 
silhouette coefficients of each area in all clusters based on the euclidean distance 
measure. This specific transport activity cluster Cn is estimated with the clus¬ 
tering quality of 67% where the silhouette coefficient of the objects in the cluster 
are positive. This means, the cluster is well separated from the other clusters 
and compact. This quality measure is increased to 77% when we estimate the 
coefficient from the communication activity temporal patterns over whole time 
periods. We picked it up for further analysis as we expect traveling to be largely 
affected by special events and also as this cluster is a particular one (in terms 
of deviations of the timeline from average), at the same time having quality 
measure. 

We further investigate exceptional (divergent) type of temporal patterns over 
public events to determine how much the communication activity deviates from 
the variations of typical communication activities using the sigma approach. 
Figure shows the changes of communication activity temporal pattern over 
Easter Sunday and Palm Sunday. The Easter Sunday has a great impact of 
human behaviors as there is an activity peak in the morning of the Easter 
Sunday. 


i/i 



hour 


Figure 7: The daily temporal communication activity pattern of Cn over typical 
Sunday compared with the communication activity pattern over Easter Sunday 
and Palm Sunday. There is an activity peak in the morning of Easter Sunday 
which is relatively increased than the typical variations of the communication 
activity, a = 3 
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6 Conclusion 


In this paper, we proposed an approach to identify relevant area clusters in 
terms of the categories of human activity (i.e., working, shopping or entertain¬ 
ment areas). The area clusters are used to contextualize mobile communication 
activities temporal patterns with the categories of human activity. An intrinsic 
method is used to assess the clustering quality if the cluster is well separated 
from other clusters and compact. And it turns out that communication activity, 
namely its density and temporal variation - is largely affected by the context 
of human activity in the area. The transport activity cluster Cn is well clas¬ 
sified with the clustering quality of 77%. With the use of those area clusters, 
we explain the typical or exceptional (divergent) type of mobile communica¬ 
tion activities. In future works, we evaluate the approach in different cities 
and measure the relation between the other types of human activity and mobile 
communication activities. The result of the research work is potentially useful 
for more coherent classifications of human behaviors and better understanding 
the relationship between human behaviors and environmental factors and their 
dynamics in real-life social phenomena. 
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